Methods for the Classification of Data from Open-Ended Questions in Surveys

Disputation
16 April 2024

Camille Landesvatter

University of Mannheim

Research Questions and Motivation

Which methods can we use to classify data from open-ended survey questions?
Can we leverage these methods to make empirical contributions to substantial questions?

Motivation:

1️⃣ Increase in methods to collect natural language (e.g., smartphone surveys with voice technologies) requires the evaluation of automated, ML-based methods.

2️⃣ Special structure of open-ended survey answers (e.g., shortness, lack of context) requires the testing of traditional and recent methods, e.g., word embeddings.

Methods for Analyzing Data from Open-Ended Questions

Table 1. Overview of methods for classifying open-ended survey responses. Source: Own depicition.

Overview of Studies

Study 1 Study 2 Study 3
How valid are trust survey measures? New insights from open-ended probing data and supervised machine learning Open-ended survey questions: A comparison of information content in text and audio response formats Asking Why: Is there an Affective Component of Political Trust Ratings in Surveys?

How valid are trust survey measures? New insights from open-ended probing data and supervised machine learning

Landesvatter, C., & Bauer, P. C. (2024). How Valid Are Trust Survey Measures? New Insights From Open-Ended Probing Data and Supervised Machine Learning. Sociological Methods & Research, 0(0). https://doi.org/10.1177/00491241241234871

Study 1: Characteristics

  • Background: ongoing debates about which type of trust survey researchers are measuring with traditional survey items (i.e., equivalence debate cf. Bauer & Freitag 20181)

  • Research Question: How valid are traditional trust survey measures?

Study 1: Methodology

  • Operationalization via two classifications: share of known vs. unknown others in associations (I), sentiment (pos-neu-neg) of associations (II)

  • Supervised classification approach:

      1. manual labeling of randomly sampled documents (n=[1,000/1,500])
      1. fine-tuning the weights of two BERT1 models (base model uncased version), using the manually coded data as training data, to classify the remaining n=[6,500/6,000]
  • Data: U.S. non-probability sample; n=1,500

Study 1: Results

Table 2: Illustration of exemplary data. Note: n=7,497.

Figure 1: Trust Scores by Associations for the Most People Question.
Note: CIs are 90% and 95%, n=1,499.

Open-ended survey questions: A comparison of information content in text and audio response formats

Landesvatter, C., & Bauer, P. C. (February 2024). Open-ended survey questions: A comparison of information content in text and audio response formats. Working Paper submitted to Public Opinion Quarterly.

Study 2: Characteristics

  • Background: requests for spoken answers are assumed to trigger an open narration with more intuitive and spontaneous answers (e.g., Gavras et al. 20221)

  • Research Question: Are there differences in information content between responses given in voice and text formats?

  • Experimental Design: random assignment into either the text or voice condition

Study 2: Methodology

  • Operationalization via application of measures from information theory and machine learning to classify open-ended survey answers
    • response length, number of topics, response entropy
  • Data: U.S. non-probability sample; n=1,461

Study 2: Results

Figure 2: Information Content Measures across Questions.
Note. CIs are 95%, n_vote-choice: 830 (audio: 225, text: 605), n_future-children: 1,337 (audio: 389, text: 748)

Asking Why: Is there an Affective Component of Political Trust Ratings in Surveys?

Landesvatter, C., & Bauer, P. C. (March 2024). Asking Why: Is there an Affective Component of Political Trust Ratings in Surveys?. Working Paper submitted to American Political Science Review.

Study 3: Characteristics

  • Background: conventional notion stating that trust originates from informed, rational, and consequential judgments is challenged by the idea of an “affect-based” form of (political) trust (e.g., Theiss-Morse and Barton 20171)

  • Research Question: Are individual trust judgments in surveys driven by affective rationales?

  • Data: U.S. non-probability sample; n=1,276

Study 3: Methodology

Figure: Methods for Sentiment and Emotion Analysis1.

Study 3: Results

Figure 3: Emotion Recognition for Speech Data with SpeechBrain. Note. CIs are 95%, n_neutral=408, n_anger=44, n_sadness=18, n_happiness=21.

Summary

  • Web surveys allow to collect narrative answers that provide valuable insights into survey responses
    • think aloud, associations, emotions, tonal cues, additional info, etc.
  • New technologies (smartphone surveys, speech-to-text algorithms) can be used to collect such data in innovative ways
  • Analyzing natural language can inform various debates, e.g.:
    • Study 1: equivalence debate in trust research
    • Study 2: survey questionnaire design research
    • Study 3: cognitive-versus-affective debate in political trust research
    • Study 1-3: item and data quality in general (e.g., associations, information content, sentiment, emotions)

Machine Learning and Open-ended Answers

Large Language models (LLMs) facilitate the accessibility and implementation of semi-automated methods.

  • traditional semi-automated methods, such as supervised ML, are helpful and appealing, but they require sufficient and high-quality training data (i.e., labeled examples)

  • E.g., Study 1: Random Forest with 1,500 labeled examples versus BERT

  • this can be a challenge for survey researchers when surveys don’t provide thousands of documents

  • LLMs allow researchers to access and leverage great capabilities without having to build complex systems from scratch

Machine Learning and Open-ended Answers

Fine-tuning pre-trained models can be valuable for classifying domain-specific data.

  • fine-tuning requires little resources and can add domain-specific context
    • Study 1: Fine-tuning with ~20% (n=1,500) documents shows high accuracy (95%) (“known-unknown others” classification)
  • But: consider the complexity and limited transparency of these models
    • always start with simple methods and evaluate
      • Study 1: Random Forest → BERT
      • Study 3: dictionary approach → deep learning
    • accuracy-explainability trade-off

Machine Learning and Open-Ended Answers

Increasing number of possibilities to reduce manual input to a minimum.

  • Study 3: zero-shot prompting result in similar findings than fine-tuned versions of pre-trained models (e.g., overlap of 80% of GPT-prompting vs. pysentimiento)
  • deciding on a suitable number of manual examples depends on various factors such as the task difficulty
    • few-shot versus zero-shot prompting
  • the lesser the manual input, the more important the manual inspection of results (e.g., Study 2: what are high-entropy documents?)

Fully manual, semi-automated, or fully automated?

The final decision for one of the approaches depends on:

  • difficulty of the given task (e.g., general versus specific codes)

  • size of the available dataset (e.g., n, splits by experimental conditions)

  • structure of the open answers (e.g., length, amount of context → this depends on the question design)

  • the amount and state of previous research (e.g., available code schemes)

  • desired accuracy and desired transparency

  • available resources (e.g., human power, computational power (GPU), time resources)

Thank you for your Attention!